************************** *NES emulation discussion* ************************** Brad Taylor (big_time_software@hotmail.com) 3rd release: September 6th, 2003 This document is designed not only to be a guide to programmers who are writing an NES emulator, but also for those individuals who have some concern for efficiency and/or emulator throughput (or, want to improve these things). Granted, with today's several gigahertz processors (or otherwise, super-fast computers), general concern for efficiency seems to be going the way of the do-do (i.e., extinct). However, writing an emulator is by no means a trivial task. Even seasoned programmers will find that writing a blazingly fast NES emulator is pretty difficult, simply because of the way NES games use the hardware. My opinion of a blazingly fast emulator would have to be one which can run full speed on a 486DX2/66 computer. Unfortunately, this level of optimization goes far beyond the extents that most programmers are willing to go to (i.e., hand-coding assembly subroutines) to engineer somthing as complex as an emulator, and this guide doesn't expect the coder to be this dedicated to efficiency, anyway. Instead, this guide can give hints and tips on general optimizations, but also points out further optimizations possible for assembly and low-level coders. Additionally, it does gear toward optimizations specifically for modern generic personal computers (PC's), so keep this in mind if you're writing your emulator for a different type of computer. This document has overgone a major overhaul from the second release. The reason it has been reworked is because I have learned a great deal more about coding efficiency on x86-based platforms. Compared to this version, the older version of this document with some of it's optimization "tips", are questionable at best. The featured addition to this document is an MMX-coded example of the heart of a very fast playfield renderer. topics discussed ---------------- General PPU emulation Pixel rendering techniques Merging playfield & object pixels Frame store optimizations CPU emulation references to other works of mine related to this document ---------------------------------------------------------- * 2C02 technical operation * discusses the low-level operation of the PPU in great detail. * PPU addressing * a visual & written documentation of the operation of the PPU's scroll counters. All documents can be found at http://nesdev.parodius.com. +---------------------+ |General PPU emulation| +---------------------+ Most likely, the key to your emulator's performance will be based on the speed at which it can render NES graphics. It's pretty easy to write a slow PPU render engine, since overall there's a good deal of work that has to be done. However, emulating the NES's PPU is not very difficult. What makes accurate emulation of it difficult however, is the trickery various NES games use to achieve special video effects (like split screen scrolling), otherwise not possible by "clean" or conventional means. In reality, all these "tricks" are simply accomplished by writing to the appropriate PPU (or related) registers at the right time during the rendering of a frame (picture). On a hardware level, the CPU & PPU in the NES run simultaniously. This is why a game can be coded to make a write out to a PPU register at a certain time during a frame, and the result of this is that the (on-screen) effect occurs in a specific location on the screen. Thus, the first instinct one has for writing a NES emulator is to execute both the CPU & PPU engines alternately on every (NES) clock cycle. The results of this will give very accurate emulation, BUT- doing this will also be VERY processor intense (since on a hardware level, the PPU does much more work per clock cycle). As a result, emulators coded like this turn out to be the slowest ones. PPU info -------- NES graphics consist of a single scrollable playfield, and 64 objects/sprites. The screen resolution is 256*240 pixels, and while games can control the graphics on a per-pixel basis, it is usually avoided since it's pretty difficult. Instead, the PPU makes displaying graphics easier for the programmer by dividing the screen up into tiles, which index an 8*8 pixel bitmap to appear in that particular spot. Each object defines 1 or 2 tiles to be displayed on a randomly-accessable xy coordinate on the screen. There are also 2*4 color palettes in the PPU that bitmap data can refer to (playfield & object bitmap data each have their own dedicated palette). Each palette has 3 indexable colors, as tile bitmaps only consist of 2 bits per pixel (the 00 combination is considered transparency). A single transparency color palette register is also defined, and is only used as the on-screen color when overlapping pixels (due to objects being placed on the playfield) of all playfield/object pixels are defined as transparent. As graphics are rendered (as described in the "2C02 technical operation" document), the name tables are accessed sequentially to reference a tile's bitmap (as described in the "PPU addressing" document), which gets used as the pixel data for the area of the screen that the name table index entry corresponds to (plus the scroll register values). Attribute tables, which are layed out the same way that name tables are (except with lower resolution- 1 attribute entry represents a 2*2 cluster of on-screen tiles), determine the palette select value for the group of tiles to use (1 of 4). Objects attribute memory (sprite RAM, or "OAM" which contain private tile index and palette select information) is evaluated every single scanline (y-coordinate entries are examined), and in-range objects have thier tile bitmaps loaded into the PPU inbetween scanlines. The contents are then merged with the playfield's pixel data in real-time. Accurate & efficient PPU emulation ---------------------------------- For the most part, PPU operations are linear and sequential, making them easy to design algorithms for. By rendering playfields and objects on a per-tile basis, emulation can be made quick & easy. However, games that use special effects (mid-frame PPU trickery) require special handling, which in turn complicates algorithm design. By implementing a clock cycle counter in the CPU core, it is possible for emulated PPU hardware to know exactly when a read/write is occuring to a PPU-related register (or otherwise, a register which is going to change the rendering of graphics from that point on). Therefore, when a write to a PPU register occurs, the PPU engine can then determine if the write is going to change the way the picture is going to be rendered, and at the exact clock cycle (which really translates into a screen position). For example, say the CPU engine is executing instructions. Then, on clock cycle 13000 (relative to the last VINT), a write to the PPU's scroll registers are made (which causes a split-screen effect). Now, first the PPU translates 13000 CC's into X/Y coordinates (in this case, on-screen scanline 93, roughly pixel #126 (the equations to do these calculations will be revealed later)). Ideally*, all pixels before this point will now be rendered to a buffer, using the data in the PPU registers prior to the write. Now the screen area before the write occured has been rendered accurately, and the screen will progressively continue to be updated in this fashion as more mid-frame writes occur. If no more occur, when the CPU arrives at the # of clock cycles per frame, the rest of the image (if any) can be rendered. * As will be discussed in the following "the virtual frame buffer" section, maintaining a "stack", or list of mid-frame PPU changes (which effect how successive rendering in the frame occurs), and only executing a PPU render routine once per frame (which then processes the stack of mid-frame writes) is a more efficient way of organizing CPU/PPU program control in your emulator. Knowing when to update the screen --------------------------------- The following list describes PPU status registers/bits that if a game modified/changed mid-frame, would change the way the rest of the frame is rendered. O = update objects, P = update playfield. O object enable bit O left column objects clipping O 8/16 scanline objects O active object pattern table O pattern table bankswitch (which effects active object pattern table) PO color emphasis bits PO black & white/color select P playfield enable bit P left column playfield clipping P scroll registers P X/Y name table selection P name table bankswitch (hypothetical) P active playfield pattern table P pattern table bankswitch (which effects active playfield pattern table) Note that any PPU mapped memory (which means name, pattern, color & palette tables) can only be changed while objects & the playfield are disabled (unless cartridge hardware provides a way to do this through the CPU memory map). Since the screen is blanked to black during this time (regardless of the current transparency color the palette is programmed with), these writes do not effect how the screen is rendered. Collision flag -------------- Games without hardware for scanline counting often poll this bit to find out when to make a write out to a PPU register which will result in a split screen, or a pattern table swap/bankswitch. The collision flag is set when the first non-transparent pixel of object 0 collides with a playfield pixel that is also non-xparent. Since the screen position of the first colliding pixel can be determined at any time (and therefore, exact CPU clock cycle at which the collision is expected to occur), when a game requests the status of this flag for the first time, a routine part of the PPU engine can calculate at which clock cycle this flag will be set (calculations will be shown later). Subsequent requests for the collision flag's status after this would then only require the engine to compare the current CPU clock cycle, to the calculated collision clock cycle. Whenever a mid-frame change occurs (whether it effects the playfield, or objects), the clock cycle at which the collision flag will go off will have to be recalculated (unless it has already gone off). MMC3 IRQ timer -------------- The MMC3's IRQ timer relies on the toggling of one of the PPU's address lines. Basically, it's counting operation is more or less at a constant rate (meaning predictable). However, when the PPU bus is disabled (via disabling the playfield & objects, or during the V-blank period), the counter must quit counting. Manual toggling of PPU address bits during this time will have to be intercepted, and the IRQ timer advanced appropriately. CPUCC to X/Y coordinate equations --------------------------------- The PPU renders 3 pixels in one CPU clock. Therefore, by multiplying the CPU CC figure by 3, we get the total amount of pixels that have been rendered (including non-displayed ones) since the VINT. 341 pixels are rendered per scanline (although only 256 are displayed). Therefore, by dividing PPUCC by this, we get the # of completely rendered scanlines since the VINT. 21 blank scanlines are rendered before the first visible one is displayed. So, to get a scanline offset into the actual on-screen image, we simply subtract the amount of non-displayed scanlines. Note that if this yeilds a negative number, the PPU is still in the V-blank period. PPUCC = CPUCC * 3 Scanline = PPUCC div 341 - 21; X- coordinate PixelOfs = PPUCC mod 341; Y- coordinate CPUcollisionCC = ((Y+21)*341+X)/3 Note that if the PixelOfs equation yeilds a number higher than 255, the PPU is in the H-blank period. +--------------------------+ |Pixel rendering techniques| +--------------------------+ 3 rendering techniques are described in this section. They are all real-time techniques. The third version of this document (unreleased) discussed a tile caching-based rendering solution. However, tile caching quickly loses it's effectiveness for those games that use mid-frame (or even mid-scanline) trickery to change character sets, or even palette values. Instead, this release contains a better & faster method of rendering using MMX instructions, and hence why I personally recommend this solution over any other one discussed here. Basic ----- This method, which is the most straightforward, is to store the PPU's 52-color matrix as constant data in the VGA palette registers (or otherwise, other palette registers used for an 8-bit per pixel graphics mode). Before a pixel can be drawn, pixel color is calculated (via pattern table & palette select data). The PPU palette registers are looked up in some way or another, and the contents of the palette register element is written to a virtual frame buffer as the pixel data. This technique is the easiest to implement, and provides the most accurate PPU emulation. However, since every pixel drawn requires an independent palette look-up, this method is naturally very slow. One way to speed up this rendering style is to create a palette table designed for looking up 2 or more pixels simultaniously. The advantages are clear: you could easily shave alot of time (close to half with a 2 simultanious color lookup) off playfield rendering. The disadvantages are that the lookup table grows from 2^4*2=32 bytes for a 2-pixel lookup, to 2^8*4=1024 bytes for a 4-pixel lookup (and 1024 bytes makes this table not suitable for staying in the CPU cache of an older CPU). Each of the palette's 4 colors is also mirrored across these tables, and this has to be maintained. Since I've never tried this optimization technique, I can't tell you how effective it is (or when it stops being effective). Another way to increase the speed of this approach is to change the bit ordering of the pattern tables stored in memory to favor this rendering algorithm. For example, store the bitmap for any scanline of a tile in an 8- 2-bit packed pixel format, instead of the 2- 8-bit planar method used by default. By doing this, it will allow the tile rendering routine to easily extract the 2-bit number for indexing the 4 color palette associated with the particular tile. Of course, by changing the pattern tables, whenever pattern table memory is read or written to, the format of the data will have to be converted. Since this happens much less often (even in games that use CHR-RAM), it's a good idea. VGA palette register indexed ---------------------------- This method involves programming the VGA palette registers to reflect the direct values the PPU palette registers contain. The VGA palette would be divided into 64- 4 color palettes. When sequential horizontal pixels are to be drawn, a large (32-bit or more) pattern table data fetch can occur (pixels for pattern table tiles should be organized in memory so that the 2 bits for each horizontally sequential pixel are stored in 8-bit increments). This chunk of fetched pixel data can then be masked (so that other pixel data from the chunk is not used), an indexed "VGA palette select" value can be added to the value, and finally can then be written out to the virtual frame buffer in one single store operation. The "VGA palette select" value is fetched via the VGA palette select table, which corresponds to the 8 classic PPU palettes (4*2 elements in the table; therefore, a tile's attribute data (either PF or OBJ) is used as the index into this table). This table indicates which 4-color group of 64 groups in the VGA palette to use for color selection for the group of pixels being written. The idea is that when a mid-frame palette change occurs (or at any time, for that matter), the affected PPU palette in this table is changed to point to where the new palette modifications will be made in the VGA's palette. The corresponding VGA palette entries will also have to be updated appropriately (generally, VGA palette updates will be made in a ring-buffer fashion. A pointer which keeps track of the first available 4 palette entries will be incremented when any entries in a 4-color PPU palette are changed). Basically, this method offers the fastest possible way to render NES graphics, since data is fetched from pattern table memory and written directly to the virtual frame buffer. The number of pixels processed simultaniously can be as high as 8 (with 64-bit integer instructions). However, the # of mid-screen PPU palette modifications possible is limited to 64 times (or 32 for PF and 32 for OBJs, if one of the bits in every pixel needs to be used to distinguish a playfield pixel from an object), but multiple consecutive modifications to a single 4-color PPU palette only count as one actual modification. MMX instruction-based rendering ------------------------------- With the introduction of the Pentium MMX, the entire x86 architecture not only got alot more complex, it also got alot better. MMX instructions are a set of single instruction, multiple data, RISC-like instructions. Nearly all the instructions have a very low 1 or 2 clock cycle latency across all x86 class CPUs which support them, hence these instructions are very desirable to use. The instructions work around an 8 element, 64-bit flat register file, which overlaps the FPU's mantissa registers. The BIG deal about use of MMX instructions for pixel rendering is that 8 pixels can be operated on simultaniously, providing each pixel is no larger than a byte. The following assembly-written routine can fully calculate pixel color for 8 horizontally sequential pixels for every loop iteration (the example actually renders the first 4 scanlines of a tile). The pattern tables have already been reorganized so that the bitmap data for 4 scanlines of tile data can be loaded into an MMX register, and used in the most efficient way possible. Pixel data for 4 sequential scanlines under the same horizontal coordinate is stored in a single byte, with the 2 MSBs containing the lowest logical scanline coordinate. Sequential bytes, up to the 8th one, contain the pixel data for every successive horizontal position. Finally, the first 8 bytes of a tile's pattern table data contain the full bitmap data for the first 4 scanlines of the tile, and the next 8 bytes contain the last 4 scanlines. #################################### ;register assignments ;-------------------- ;EAX: destination pixel pointer ;EBX: points to the palette to be used for this tile (essentially determined by the attribute table) ;EDX: points to a 32-byte (4 tile scanline) buffer for implementing fine horizontal scroll effects ;ESI: source pointer for 32-pixel bitmap to be loaded from pattern table ;MM4: (8 - fine horizontal scroll value)*8 ;MM5: ( fine horizontal scroll value)*8 ;fetch 32 pixels from pattern table, organized as 8 horizontal x 4 vertical. movq mm3,[esi] mov ecx,-4; load negative loop count ;move constants related to color calculation directly into registers. These have to be stored in memory since MMX instructions don't allow the use of immediate data as an operand. @1: movq mm0,_BFx8; contains 0BFBFBFBFBFBFBFBFh movq mm1,_FFx8; contains 0FFFFFFFFFFFFFFFFh movq mm2,_3Fx8; contains 03F3F3F3F3F3F3F3Fh ;generate masks depending on the magnitude of the 2 MSBs in each packed byte (note that this is a signed comparison). pcmpgtpb mm0,mm3 pcmpgtpb mm1,mm3 pcmpgtpb mm2,mm3 psllq mm3,2; shift bitmap to access next scanline of pixels ;to perform color lookup, a precalculated palette table is used & ANDed with the resulting masks of the last operation. Since XOR operations are used to combine the results, this requires the elements in the palette table to be XORed with adjacent values, so that they'll be cancelled out at the end of the logic processing here. The required precalculated XOR combination of each color element is shown in the comments below by the corresponding element. Note that each lookup is 8 bytes wide; this requires the same palette data for a single element to be mirrored across all 8 sequential bytes. pand mm0,[ebx+00]; 2^1^0^3 pand mm1,[ebx+08]; 2^1^0 pand mm2,[ebx+16]; 2^1 pxor mm0,[ebx+24]; 2 pxor mm1,mm2 ;this logic performs shift functionality, in order to implement fine horizontal scrolling. The alternative to this is simply writing 64 bits out to unaligned addresses for fine H scroll values other than zero, but since this can incur large penalties on Intel-based processors, this is generally the preferred way to generate the fine horizontal scroll effect. movq mm2,[edx+ecx*8+32] pxor mm0,mm1 psrlq mm2,mm5 movq [edx+ecx*8+32],mm0 psllq mm0,mm4 por mm0,mm2 movq [eax],mm0 ;loop maintenence add eax,LineLen; advance pixel pointer to next scanline position inc ecx jnz @1 ################################### Obviously, code will have to be added here to compensate for games that may require scanline-granular PPU rendering, but this won't change the # of instructions in the inner loop. At any rate, the inner loop can execute in just about 14 clock cycles on an Athlon, and that translates into 14/8 = 1.75 clock cycles per fully rendered pixel. +---------------------------------+ |Merging playfield & object pixels| +---------------------------------+ The most efficient way to effectively combine playfield & object data into your final rendered frame, is to always first, render your playfield (or a section of it, in the case of dealing with a split screen) directly to the image buffer itself. At this point, to effectively merge object pixels with the playfield's, each pixel in your image buffer must have an extra 2 bits associated with it, one of which will represent the transparency status for a playfield pixel, and the other the same, except for object pixels (when drawn later). Naturally, after rendering the playfield, the image buffer won't have any pixels with the transparency status for object pixels marked as false. But now, as objects are rendered, the condition on that the actual pixel is drawn, depends on these two transparency status bits, the objects own transparency status, and it's priority. Starting in the order from object 0 (highest priority) up to 63, object bitmaps are "merged" with the playfield, in the fashion that the following few lines of pseudo-code will show: IF(SrcOBJpixel.xpCond=FALSE)THEN IF((DestPixel.OBJxpCond=TRUE)AND((DestPixel.PFxpCond=TRUE)OR(SrcOBJpixel.Pri=foreground)))THEN DestPixel.data := SrcOBJpixel.data FI DestPixel.OBJxpCond := FALSE FI So, as you can see, the destination's OBJxpCond is marked as false, even if the object's pixel is not meant to be drawn. This is to prevent the pixels of lower priority (numerically higher-numbered) objects from being drawn in those locations. This may raise the question, "Why do you render objects in the order of 0->63 (effectively requiring 2 bits for transparency status), when you can render them in the opposite direction (which only requires 1 bit for transparency status)?" The answer is because of what happens on a priority clash (see the "PPU pixel priority quirk" section of the "2C02 technical operation" document). Rendering objects in order of 0->63 is the only way to emulate this PPU feature properly (and some games DO depend on the functionality of this, as it provides a way to force the playfield to hide foreground priority object pixels). Otherwise (for 63->0), it would be neccessary to merge objects to an image buffer filled with the current transparency color, and then, merge playfield data with the buffer as well. Granted, this technique will only require 1 transparency (background priority) status bit per pixel, but since merge operations are slow, and this technique requires way more of them, this technique is inferior to the aforementioned one. other tips ---------- - Depending on your implementation of pixel rendering, you may be able to store the 2 transparency status bits inside the pixel data itself. For example, if only 52 combinations of a rendered pixel are being generated, the upper 2 bits in the pixel's byte can be used for this storage. This may mean that you'll have to mirror your video buffer's palette register RGB information 4 times, but is otherwise is a good idea. For 8-bit color VGA modes, a legacy mask register (3C6h) allows the programmer to mask out any bits of the written pixel data that are unrelated to color generation. - Don't use branching to avoid drawing a pixel out somewhere. First of all, it only allows you to process 1 pixel at a time, which is slow. Second, CPUs have a hard time predicting branches based on random data (or at minimum, data that produces a branch pattern which is too long to be stored in the CPU's branch target buffer). Finally, sequences of arithmetic and logical operations can be used to merge multiple bytes of data simultaniously (espically with MMX instructions). - Inline code in small loops with a constant # of iterations, espically if the loop count is low, and is the most inner. This reduces overhead by avoiding a backwards branch, and the loop & index counters required. For example, when drawing tiles, it would be a good idea to inline the code to draw 8 horizontal pixels. +-------------------------+ |frame store optimizations| +-------------------------+ One of the simplest approaches to render emulation is to draw the entire playfield to the video buffer, and then place applicable object data in the buffer afterwards (this makes object/playfield pixel decisions easier to calculate since the playfield pixels are already rendered). This straight-forward approach would also be the most efficient way to deal with rendering NES graphics, were there not certain caveats of using the video buffer. Since I don't know how windows or Linux programmers do this sort of thing, this topic may not be applicable to them. - Video frame buffer reading is painfully slow. No matter how fast of a video card a computer has, reading video memory is going to be at least 10 (ten!) times slower than writing to it. This would effect the time when objects are being rendered, when the contents of the playfield that underlap the object need to be read in & merged with the object's pixels. - Writing to random places in video memory is painfully slow. Because modern I/O devices (PCI,AGP) in PC's share address lines with data lines (over the same bus), there's overhead in writing to a random place in the video memory. This idea was designed with data streaming in mind, so that when sequential writing occurs (a pretty common thing), bus lines which would otherwise be wasted on keeping track of a sequentially-increasing address can now be used to carry data. So, a non-sequential transfer of data to the video card could take as much as double the amount of time that a sequential transfer does. This point alone makes rendering the playfield on a per-tile basis (where you're basically only storing 8 sequential pixels at a time for each scanline of the tile) directly to the video buffer one of the worst approaches. Sequential data transfers to the video card are close to optimal at 256 bytes at a time and up. - Writing to random, unaligned places in video memory is incredibly slow. This is because a merge operation has to take place on the hardware level: a read from the video card (which is already slow), and the write. However, this operation is only required at the beginning & end of an unaligned, sequential transfer. Thus, unaligned data streaming to the video card has less of penalty the larger the transfer is (I measured an 11% additional overhead on an unaligned sequential store of 512 bytes to the video card buffer, and double that for a 256-byte xfer). Video addresses which are divisible by 64 bytes are considered to be aligned. - Writing to the video memory in small (byte-by-byte) store operations is a bad idea. While modern PC CPU/chipset hardware may be able to detect & combine several little sequential stores to the video buffer as full-size ones, there's no guarantee this will happen. Older hardware certainly doesn't do this. And- if the small writes aren't being combined together, than guess what? The chipset will perform a merge operation for each data item transfered that isn't full size. Obviously, this could be the worst possible way to send data to the video buffer (well, next to storing small data at random places in video memory). - Writing to a non- linear frame buffer (LFB) is slow. At least on one card I tested, there was a 333% increase in video buffer write speed, after switching from using the legacy one at address 000A0000. I understand that basically any PCI video card has LFB-capabilities, but may be inaccessable due to it's BIOS, or drivers. I guess that this is really a responsibility of the OS, but either way: use the LFB any way you can. the virtual frame buffer ------------------------ Now you should see that it's just not a good idea to render graphics directly to the video buffer (although I don't think any one would do this, anyway). Prior versions of this document discussed using a virtual frame buffer, which was basically a buffer allocated in regular memory used to render graphics to (instead of directly to the video buffer). When the virtual buffer was full, it would then be copied to the video buffer in a large, sequential operation (just the way the video card likes it!). It embarasses me to admit that this method is actually quite inefficient, and I'd like to clear the air on that issue. To understand why it's inefficient, let me explain a little about how CPU caches work. If you know how virtual memory works, well, a CPU cache is basically like a hardware version of this, although not for a disk drive, but main system RAM. A CPU caches data on a line-by-line basis. Each line is anywhere from 16 (486) to 32 (Pentium) to 64 (Athlon) bytes in size, and will most likely grow bigger in future CPUs. So, if only a single byte needs to be read from memory, an entire line is actually loaded into the cache (this is why data alignment, and grouping related data fields together in records is important to guarantee the effectiveness of a CPU's cache). This action also pushes another line out of the cache (ideally the least-recently used one), and if it's dirty (modified), it will be written back to main memory. 486's started the on-chip CPU cache trend, with a whole 2K bytes for data and code. 486DX4 models had 8K bytes. Pentiums had 16K, etc. The point is, the size of the cache is basically the size of memory that the CPU can randomly access for next to no clock cycles at all (providing the target address is in there). Of course, this only applies to the L1 cache, since there's some latency involved with accessing L2 cache, depending on it's location (older CPU's never had on-chip L2 cache; it was located on the mainboard). The trick to effective use of the cache is all how software is written. The best thing to do is to divide software program control up evenly between the individual tasks it performs, so that only a minimum amount of cache misses occur when the task is switched, and completely different data areas are being accessed by the program. the virtual frame buffer caveat ------------------------------- Lets consider the virtual frame buffer (VFB) model. We start rendering our playfield. Name tables and pattern tables are accessed, and that's fine (the name tables are easily cached, and even some pattern table data gets cached). Then, we store our rendered pixel data to our buffer. The pixel data is stored to the VFB using a data size of 4 bytes (or otherwise, the biggest size that the processor will allow the programmer to store with). However, the CPU's cache line size is always bigger than this, and therefore the CPU performs a merge operation with written data, and the cache line of the data being written to. Now- here's the first problem: the target of the store operation to the VFB is unlikely to be in the cache. This means that the CPU ends up actually *reading* main memory after your first 4-byte pixel store. Of course, now you can write to this line for free, but main memory access is slow, and considering what we're doing here (which is exclusively store operations), it's kind of ridiculous that the programmer has no way of telling the processor that the merge operation (and moreover the reading of main memory) is unneccessary, since we plan on overwriting all the original data in that particular line (newer x86 processors offer instructions for better management of non-temporal storage). Anyway, you get the idea: after every few stores to the VFB occur, a new line from the VFB will be read in. But guess what? this isn't even the worst part of it. As you keep filling the VFB, your CPU's cache overflows, since your CPU's L1 cache is smaller than the VFB you're working on. This means that not only will your VFB-rendering eventually push any lines out of the cache which aren't used directly by the render routine (causing lost cycles for even local routines that may need them immediately after the render), but after the render when you go to copy the VFB to the video memory, the entire buffer has to be loaded back into the CPU's cache. Of course, size of CPU cache is everything here. Currently hovever, to my knowledge, the largest L1 cache implemented in a CPU is 64K bytes, and that's only for Athlon (the P4's is smaller than this). With that said, modern CPUs are simply unfit for this style of rendering. scanline stores --------------- By reducing the size of the VFB from full size down to a few scanlines (or even just one), most or all of the caveats of what has been mentioned can be avoided. Since typically a VFB scanline is 256 bytes (in the example for the PPU), this makes the memory requirement small enough to ensure modest performance even on a 486. Of course, this creates a new problem for writing the PPU render engine- tiles can no longer be rendered completely (unless you're using an 8-scanline VFB, but the rest of this topic assumes you're using only a single scanline VFB). Some overhead caused by only rendering a single scanline of a tile at a time can be avoided by pre-calculating pointer work for each sequential tile, and storing it in an array, so that calculations can be reused for the tile's other scanlines. A similar technique can be done for object pointer calculations as well. Prehaps a possible performance boost obtainable through a careful scanline rendering engine design, is that storing rendered pixels directly to the video buffer may be permitted, since all pixels on the scanline are rendered sequentially (and thus, can be stored out that way). However, there are conditions that determine the effectiveness of this. First, obviously this could only be done for the tiles on the scanline where objects are not placed; a temporary buffer in memory will still have to be used to make playfield/object decisions for those tiles that do (and not to mention, some method of determining the sequence in which objects appear on the scanline). The second condition depends on alignment. If the PPU's fine scroll offset is 0 (or even 4- if using 32-bit store operations), pixel data can be written out to the video buffer directly with no performance penalty (and also with no leading/trailing tile debris on the ends of the scanline written to video mem). If it is other than 0 or 4, it may be faster to store it unaligned to a scanline VFB, then copy it back aligned. Another option is to perform byte shifting/rotating on the fetched bitmaps before storing directly to the video buffer (as is done with the MMX example shown above). +-------------+ |CPU emulation| +-------------+ 6502 Opcodes for a particular game must be processed, and a unique task carried out for each one (or at least, the documented ones). This section discusses efficient implementation of a 6502 engine for the NES. Basic optimizations ------------------- - Interpreted (jump table-based) vs. dynamic recompiliation (converting 6502 instruction sequences to x86 ones on-the-fly). Basically, dynamic recompiliation has alot of potential for generating optimal CPU emulation performance. Unfortunately however, these performance benefits can only be produced on modern computers, due to the large amount of data structures that need to be maintained to allow it to work (lots and lots of memory is required as well). On the other hand, jump table-based 6502 instruction processing is more favorable on slower machines, since even on a Pentium, an indirect jump only incurs a 5 clock cycle penalty, providing it's not predicted (which is usually the case. Predicted jumps of any type are just one clock). By reducing the size of your processor core (by using branchless algorithms for 6502 effective address calculations, or other places it can be benificial), this would only require your core to do indirect jumping to handle the emulation of different instructions- which will only require a look-up table with about 20 or 30 branch targets. - Weigh the pros and cons of writing a 6502 emulator with an exclusive routine to execute for every instruction carefully. Exclusive routines for all 151 6502 opcodes only makes the throughput part of instruction emulation quicker. However, due to the larger code size, it is more likely that more performance will be lost on older computers, due to the core's resident size forcing other pieces of code to be pushed out of the CPU cache. - Implement a clock cycle counter into your 6502 engine, which will be maintained by every 6502 instruction executed. This counter will mainly be used by the PPU to figure out how timed writes will effect how the output image will be rendered. However, if used also as a terminal counter, when the count expires, program control can be transferred to a different part of the 6502 engine to emulate the behaviour of an event that would normally occur on the NES, on that exact clock cycle (i.e., a timed IRQ event). Also, don't forget that you can manage any number of "virtual cycle counters", without ever having to make the CPU core maintain more than one physical one. NES hardware may have several IRQ-generating counters going simultaniously, but the order in which each will cause an IRQ is always known to the emulator, which is why the cycle count register only has to be programmed with the count value of the next IRQ to occur. - As 6502 instructions usually require the P (processor status, or flags) register to be updated after an ALU operation, the x86's ALU instructions updates it's flags register in a similar mannar. Therefore, after emulating the ALU behaviour of a 6502 instruction with an x86 one, use instructions like "LAHF" or "SETcc" to acquire the status of sign, zero, carry, and overflow condition codes. Furthermore, have your emulator store the 6502 flags in the format that they're stored in on the x86 CPU. This way, the flags do not have to be formatted, thus saving time. The only time the flags will have to be converted to/from the 6502 order, is when 6502 instructions PHP, PLP, BRK #xx, RTS, and hardware interrupts are executed. Since these happen much less often than more common arithmetic and logical instructions, it's more efficient to handle the flags in this way. - use the x86's registers to store all the 6502's internal registers in (including a cycle counter). There is enough room to leave 3 general purpose registers left for temporary use. However, don't go out of your way to do this; leaving some commonly used emulation variables in memory only incurs a small penalty, if any (usually a simple load from memory can execute in a single clock). Advanced optimizations ---------------------- - dealing with paged/bankswitched memory. Because the 6502's memory map is nonlinear, when the 6502 engine needs to read/write memory from/to a part of the NES's memory map, use part of the target address to index into a table of pointers. This table of pointers (hereon memory map translation table) will represent the architecture of the NES memory map (there will have to be 2 seperate tables of this type for handling reads & writes from/to NES memory). Each entry in the table represents a linear page (or bank) in the NES's memory map. Generally, this table will have to have a page bank granularity as small as 2K (i.e., 32 entries), to emulate the mirrioring behaviour of the NES's internal RAM properly. Reserve a single bit in each pointer entry to indicate if the target page pointer points directly to a base memory address (in the x86 memory map), or an x86 subroutine (used for emulating hardware behaviour when access to that address occurs). If the entry is a subroutine, then program control should be transferred to the pointer's address. Otherwise, no branch should occur, and the code that stores/loads the data via the pointer base (the original address will be used as the offset into the page) should be inlined within the opcode's code body. Also remember that 6502 opcodes that work directly with stack or zero-page memory can avoid doing a look-up into the memory map translation table, since these areas are hardwired to the NES's internal 2K RAM. - fetching instructions. Because the 6502 is always fetching instructions, and in a sequential mannar, it is more efficient to maintain the 6502's PC as a 32-bit IPC register (Indexed PC) which points to the physical address at which the instructions reside. This avoids having to test the PC's address (with the memory map pointer table) every time an instruction byte needs to be fetched from memory. - handling program control transfering When the IPC must be modified via an absolute program control xfer instruction, using the new 16-bit PC address, the memory map page table is consulted to load the new base address for IPC (this base address is also saved for later). The PC value is then added to IPC to offset it into the page. When a PC value must be saved to the stack, the saved base address (mentioned previously) is subtracted from IPC to calculate the current PC value. Note that support for executing 6502 instructions residing in a hardware register mapped part of memory will requre additional code. - detecting when PC exceeds a page/bank boundary. Because of the way IPC is implemented, when IPC crosses a page boundary, instead of the next virtual page being executed, the next physical page is executed. This will result in the CPU taking behaviour it never intended to take, and may result in compromising the stability of the game being emulated. This problem cannot be overcome, but it may be trapped. By placing 3 of the same special opcodes at the end of these banks in physical memory, the page boundary crossing condition can be detected. Now, this doesn't give much hope towards running games that may potentially do this, but if the game did this anyways, it would probably hang the game without much explanation. By trapping this occurence, you'll know exactly what happened. EOF